work on adding voyager to evals #959

filip-michalsky · 2025-08-12T01:39:49Z

⏺ ## Why
Add WebVoyager and GAIA evaluation suites to benchmark Stagehand's web navigation and
reasoning capabilities against industry-standard datasets.

What Changed

Added WebVoyager eval suite with 643 test cases for web navigation tasks
Added GAIA eval suite with 90 test cases for general AI assistant tasks
Refactored eval infrastructure to support sampling and filtering
Created reusable utilities for JSONL parsing and test case generation
Added configuration for new eval suites in evals.config.json

Environment Variables

EVAL_WEBVOYAGER_SAMPLE: Random sample size from WebVoyager dataset
EVAL_WEBVOYAGER_LIMIT: Max cases to run (default: 25)
EVAL_GAIA_SAMPLE: Random sample size from GAIA dataset
EVAL_GAIA_LIMIT: Max cases to run (default: 25)
EVAL_GAIA_LEVEL: Filter GAIA by difficulty level (1, 2, or 3)

Sampling Strategy

The sampling implementation uses Fisher-Yates shuffle for unbiased random selection
when SAMPLE is specified, otherwise takes the first LIMIT cases. This allows for
both deterministic (first N) and randomized (sample N) test runs.

Test Plan

# Test WebVoyager with OpenAI
EVAL_SUITE=webvoyager EVAL_WEBVOYAGER_SAMPLE=1
EVAL_MODEL=openai/gpt-4o-computer-use-preview pnpm run evals

# Test WebVoyager with Claude
EVAL_SUITE=webvoyager EVAL_WEBVOYAGER_SAMPLE=1
EVAL_MODEL=anthropic/claude-3-5-sonnet-20241022 pnpm run evals

# Test GAIA with OpenAI
EVAL_SUITE=gaia EVAL_GAIA_SAMPLE=1 EVAL_MODEL=openai/gpt-4o-computer-use-preview pnpm
 run evals

# Test GAIA with Claude
EVAL_SUITE=gaia EVAL_GAIA_SAMPLE=1 EVAL_MODEL=anthropic/claude-3-5-sonnet-20241022
pnpm run evals

# Verify existing evals still work
pnpm run evals

changeset-bot · 2025-08-12T01:39:52Z

🦋 Changeset detected

Latest commit: 51246f6

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

scripts/run-evals.ts

greptile-apps

Greptile Summary

This PR adds two industry-standard evaluation suites to benchmark Stagehand's web automation capabilities: WebVoyager (643 test cases) and GAIA (90 test cases). The changes significantly expand the evaluation infrastructure to support data-driven benchmarking against established datasets.

The core architectural change introduces a suite-based evaluation system alongside the existing task-based approach. New suite builders (evals/suites/webvoyager.ts and evals/suites/gaia.ts) read JSONL dataset files and dynamically generate test cases, while corresponding task implementations (evals/tasks/agent/webvoyager.ts and evals/tasks/agent/webarena_gaia.ts) execute the actual evaluations. The system supports flexible sampling strategies using Fisher-Yates shuffle for randomized selection or deterministic first-N selection.

Key infrastructure improvements include:

A new core/summary.ts module that extracts summary generation logic into a reusable component
Enhanced type system with optional taskParams and params fields to pass dataset-specific parameters to evaluation functions
New utility functions for JSONL parsing, data validation, and sampling in evals/utils.ts
Environment variable configuration for controlling test execution (sample sizes, limits, difficulty levels)
Updated evaluation runner logic in index.eval.ts to handle both static tasks and dynamic dataset-driven evaluations

The datasets themselves are substantial additions: WebVoyager contains 643 web navigation tasks across 13+ websites (Amazon, Google services, GitHub, etc.), while GAIA provides 90 general AI assistant tasks with varying difficulty levels. Both datasets start from standardized URLs and expect structured response formats.

This integration maintains full backward compatibility with existing evaluations while providing the foundation for systematic benchmarking against industry standards. The sampling capabilities allow for both development testing (small samples) and comprehensive evaluation runs.

Confidence score: 4/5

This PR is safe to merge with minimal risk as it maintains backward compatibility and adds well-structured evaluation capabilities
Score reflects solid implementation patterns and comprehensive infrastructure changes, though there's a potential division-by-zero edge case in summary generation
Pay close attention to evals/core/summary.ts for the division-by-zero issue in category success rate calculation

_{12 files reviewed, 3 comments}

_{Edit Code Review Bot Settings | Greptile}

evals/utils.ts

evals/tasks/agent/webvoyager.ts

evals/tasks/agent/webarena_gaia.ts

evals/tasks/agent/webvoyager.ts

Resolved conflicts by merging agent task configurations and including both taskParams and agent properties in StagehandInitResult interface. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Add agent evaluation support to CI pipeline

evals/tasks/agent/webarena_gaia.ts

.github/workflows/ci.yml

types/evals.ts

Co-authored-by: Miguel <[email protected]>

filip-michalsky · 2025-08-27T18:50:12Z

added category to ci for external agent benchmarks:


  # Run all external benchmarks (both GAIA and WebVoyager)
  pnpm evals category external_agent_benchmarks env=LOCAL trials=1

  # Run only GAIA with max 10 test cases
  pnpm evals category external_agent_benchmarks --dataset=gaia max_k=10 env=LOCAL trials=1

  # Run only WebVoyager with max 5 test cases, 2 trials each
  pnpm evals category external_agent_benchmarks --dataset=webvoyager max_k=5 env=LOCAL trials=2

  # Backward compatible - run specific benchmark by name
  pnpm run evals name=agent/gaia api=false trials=1 max_k=10

.github/workflows/ci.yml

miguelg719 · 2025-08-27T20:01:25Z

evals/args.ts

@@ -39,6 +41,13 @@ for (const arg of rawArgs) {
    }
  } else if (arg.startsWith("provider=")) {
    parsedArgs.provider = arg.split("=")[1]?.toLowerCase();
+  } else if (arg.startsWith("--dataset=")) {
+    parsedArgs.dataset = arg.split("=")[1]?.toLowerCase();
+  } else if (arg.startsWith("max_k=")) {


note for later: we should make this arg a bit more intuitive (along max number of evals or sth)

Co-authored-by: Miguel <[email protected]>

work on adding voyager to evals

2e3e151

filip-michalsky added 10 commits August 11, 2025 21:46

add gaia evals

d272670

Merge branch 'main' into fm/stg-661-add-web-voyager

1e27e07

refactor

71cfcb0

add sampling of suites

16e3a52

updating evals

0461322

linting

4aaca65

Merge branch 'main' into fm/stg-661-add-web-voyager

1922e38

remove logs, small updates

b9b1702

remove logs

22c9fe7

revert unwanted change

dcaeb83

filip-michalsky commented Aug 15, 2025

View reviewed changes

scripts/run-evals.ts Outdated Show resolved Hide resolved

more revert

df880ac

filip-michalsky marked this pull request as ready for review August 15, 2025 00:52

filip-michalsky requested a review from miguelg719 August 15, 2025 00:52

greptile-apps bot reviewed Aug 15, 2025

View reviewed changes

evals/utils.ts Outdated Show resolved Hide resolved

evals/tasks/agent/webvoyager.ts Show resolved Hide resolved

evals/tasks/agent/webarena_gaia.ts Outdated Show resolved Hide resolved

filip-michalsky added 5 commits August 15, 2025 20:52

load env at root

087f8cd

add changeset

7c1d5a0

update

bd7352b

update ci

4be8864

ci update

b78d824

tkattkat reviewed Aug 20, 2025

View reviewed changes

evals/tasks/agent/webvoyager.ts Show resolved Hide resolved

filip-michalsky requested a review from tkattkat August 20, 2025 18:17

filip-michalsky and others added 3 commits August 20, 2025 20:08

Merge main branch

70a7b7c

Resolved conflicts by merging agent task configurations and including both taskParams and agent properties in StagehandInitResult interface. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Merge main branch

9a7c057

Resolved conflicts by merging agent task configurations and including both taskParams and agent properties in StagehandInitResult interface. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Merge fm/stg-670-add-agent-to-ci into fm/stg-661-add-web-voyager

f9817c6

Add agent evaluation support to CI pipeline

seanmcguire12 reviewed Aug 25, 2025

View reviewed changes

evals/tasks/agent/webarena_gaia.ts Outdated Show resolved Hide resolved

miguelg719 reviewed Aug 25, 2025

View reviewed changes

.github/workflows/ci.yml Outdated Show resolved Hide resolved

miguelg719 reviewed Aug 25, 2025

View reviewed changes

types/evals.ts Outdated Show resolved Hide resolved

filip-michalsky and others added 5 commits August 26, 2025 21:34

Update .github/workflows/ci.yml

c04a226

Co-authored-by: Miguel <[email protected]>

lint

bc85c12

exclude gaia and voyager from agent ci

5081d3b

update stagehandInitType to send eval inputs in EvalInput

ce6abd4

add external agent benchmarks as a category to CI

f5abf91

miguelg719 reviewed Aug 27, 2025

View reviewed changes

.github/workflows/ci.yml Outdated Show resolved Hide resolved

miguelg719 reviewed Aug 27, 2025

View reviewed changes

miguelg719 approved these changes Aug 27, 2025

View reviewed changes

Update .github/workflows/ci.yml

51246f6

Co-authored-by: Miguel <[email protected]>

filip-michalsky merged commit 09b5e1e into main Aug 27, 2025
5 checks passed

This was referenced Aug 27, 2025

Version Packages #1031

Open

Version Packages Malumbo21/stagehand#114

Open

Version Packages erickirt/stagehand#72

Open

Version Packages CloudEngineHub/stagehand#1

Open

Version Packages pchaganti/gx-stage-hand#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

work on adding voyager to evals #959

work on adding voyager to evals #959

Uh oh!

filip-michalsky commented Aug 12, 2025 •

edited

Loading

Uh oh!

changeset-bot bot commented Aug 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

filip-michalsky commented Aug 27, 2025

Uh oh!

Uh oh!

miguelg719 Aug 27, 2025

Uh oh!

Uh oh!

Uh oh!

work on adding voyager to evals #959

work on adding voyager to evals #959

Uh oh!

Conversation

filip-michalsky commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What Changed

Environment Variables

Sampling Strategy

Test Plan

Uh oh!

changeset-bot bot commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 4/5

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

filip-michalsky commented Aug 27, 2025

Uh oh!

Uh oh!

miguelg719 Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

filip-michalsky commented Aug 12, 2025 •

edited

Loading

changeset-bot bot commented Aug 12, 2025 •

edited

Loading